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Abstract. Kernel methods are powerful tools in machine learning. They 
have to be computationally efficient. In this paper, we present a novel 
Geometric-based approach to compute efficiently the string subsequence 
kernel (SSK). Our main idea is that the SSK computation reduces to 
range query problem. We started by the construction of a match list 
L(s,t) = {( i,j ) : Si = tj} where s and t are the strings to be compared; 
such match list contains only the required data that contribute to the 
result. To compute efficiently the SSK, we extended the layered range 
tree data structure to a layered range sum tree, a range-aggregation data 
structure. The whole process takes 0(p\L\ log \L\) time and 0(|L| log \L\) 
space, where \L\ is the size of the match list and p is the length of 
the SSK. We present empiric evaluations of our approach against the 
dynamic and the sparse programming approaches both on synthetically 
generated data and on newswire article data. Such experiments show 
the efficiency of our approach for large alphabet size except for very 
short strings. Moreover, compared to the sparse dynamic approach, the 
proposed approach outperforms absolutely for long strings. 

Keywords: string subsequence kernel, computational geometry, layered 
range tree, range query, range sum 


1 Introduction 

Kernel methods [1] offer an alternative solution to the limitation of traditional 
machine learning algorithms, applied solely on linear separable problems. They 
map data into a high dimensional feature space where we can apply linear learning 
machines based on algebra, geometry and statistics. Hence, we may discover 
non-linear relations. Moreover, kernel methods enable other data type processings 
(biosequences, images, graphs, ...). 

Strings are among the important data types. Therefore, machine learning 
community devotes a great effort of research to string kernels, which are widely 
used in the fields of bioinformatics and natural language processing. The phi¬ 
losophy of all string kernels can be reduced to different ways to count common 

* This work is supported by the MESRS - Algeria under Project 8/U03/7015. 
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substrings or subsequences that occur in both strings to be compared, say s and 

t. 

In the literature, there are two main approaches to improve the computation 
of the SSK. The first one is based on dynamic programming; Lodhi et al. m apply 
dynamic programming paradigm to the suffix version of the SSK. They achieve 
a complexity of 0(p|s||t|), where p is the length of the SSK. Later, Rousu and 
Shawe-Taylor [7. propose an improvement to the dynamic programming approach 
based on the observation that most entries of the dynamic programming matrix 
(DP) do not really contribute to the result. They use a set of match lists combined 
with a sum range tree. They achieve a complexity of 0(p\L\ logmin(|s|, |t|)), where 
L is the set of matches of characters in the two strings. Beyond the dynamic 
programming paradigm, the trie-based approach WM is based on depth first 
traversal on an implicit trie data structure. The idea is that each node in the 
trie corresponds to a co-occurrence between strings. But the number of gaps is 
restricted, so the computation is approximate. 

Motivated by the efficiency of the computation, a key property of kernel 
methods, in this paper we focus on improving the SSK computation. Our main 
idea consists to map a machine learning problem on a computational geometry 
one. Precisely, our geometric-based SSK computation reduces to 2-dimensional 
range queries on a layered range sum tree (a layered range tree that we have 
extended to a range-aggregate data structure). We started by the construction 
of a match list L(s,t) = {( i,j ) : .s ; ; = tj} where s and t are the strings to be 
compared; such match list contains only the required data that contribute to the 
result. To compute efficiently the SSK, we constructed a layered range sum tree 
and applied the corresponding computational geometry algorithms. The overall 
time complexity is 0(p\L\ log |L|), where \L\ is the size of the match list. 

The rest of this paper is organized as follows. Section [2] deals with some 
concept definitions and introduces the layered range tree data structure. In 
section [3] we recall formally the SSK computation. We also review three efficient 
computations of the SSK, namely, dynamic programming, trie-based and sparse 
dynamic programming approaches. Our contribution is addressed in Section [4] 
Section |5]presents the conducted experiments and discusses the associated results, 
demonstrating the practicality of our approach for large alphabet sizes. Section [6] 
presents conclusions and further work. 

2 Preliminaries 

We first deal with concepts of string, substring, subsequence and kernel. We then 
present the layered range tree data structure. 

2.1 String 

Let £ be an alphabet of a finite set of symbols. We denote the number of symbols 
in £ by |I7|. A string s = si...S| s i is a finite sequence of symbols of length |s| 
where Si marks the i th element of s. The symbol e denotes the empty string. We 
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use S n to denote the set of all finite strings of length n and S* the set of all 
strings. The notation [s = t] is a boolean function that returns 

{ 1 if s and t are identical; 

0 otherwise. 


2.2 Substring 

For 1 < i < j < |s|, the string s(i : j) denotes the substring SjSj + i...Sj of s. 
Accordingly, a string t is a substring of a string s if there are strings u and v such 
that s = utv ( u and v can be empty). The substrings of length n are referred to 
as n-grams (or n-mers). 

2.3 Subsequence 

The string t is a subsequence of s if there exists an increasing sequence of indices 
I = (*i, in s, (1 < ii < ... < i\ t \ < |s|) such that tj = for j = 1,..., |t|. 

In the literature, we use t = s(J) if t is a subsequence of s in the positions given 
by I. The empty string e is indexed by the empty tuple. The absolute value 
f denotes the length of the subsequence t which is the number of indices |/|, 
while 1(1) = i\ t \ — ii + 1 refers to the number of characters of s covered by the 
subsequence t. 

2.4 Kernel methods 

Traditional machine learning and statistic algorithms have been focused on 
linearly separable problems (i.e. detecting linear relations between data). It is 
the case where data can be represented by a single row of table. However, real 
world data analysis requires non linear methods. In this case, the target concept 
cannot be expressed as simple linear combinations of the given attributes [J. 
This was highlighted in 1960 by Minsky and Papert. 

Kernel methods were proposed as a solution by embedding the data in a 
high dimensional feature space where linear learning machines based on algebra, 
geometry and statistics can be applied. This embedding is called kernel. It arises 
as a similarity measure (inner product) in a high dimension space so-called feature 
description. 

The trick is to be able to compute this inner product directly from the original 
data space using the kernel function. This can be formally clarified as follows: 
the kernel function K corresponds to the inner product in a feature space F via 
a map (f). 


(j) : X F 
x i ^ 4>(x) 

K(x,x') = (cj)(x),<j>(x')). 
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2.5 Layered Range Tree 

A Layered Range Tree (LRT) is a spatial data structure that supports orthogonal 
range queries. It is judicious to describe a 2-dimensional range tree inorder to 
understand LRT. Consider a set S of points in 1Z 2 . A range tree is primarily a 
balanced binary search tree (BBST) built on the ^’-coordinate of the points of S. 
Data are stored in the leaves only. Every node v in the BBST is augmented by 
an associated data structure (' T aS soc{v )) whitch is a 1-dimensional range tree, it 
can be a BBST or a sorted array, of a canonical subset P(v) on ^-coordinates. 
The subset P(v) is the points stored in the leaves of the sub tree rooted at 
the node v. Figure [T] depicts a 2-dimensional range tree for a set of points 
S = {(2,2), (5, 2), (3, 3), (4, 3), (2,4), (5,4)}. In the case where two points have 
the same x or y-coordinate, we have to define a total order by using a lexicographic 
one. It consists to replace the real number by a composite-number space [2;- The 
composite number of two reals x and y is denoted by {x\y), so for two points, we 
have: 

(a-|y) < (x'\y') 4=> x < x' V (x = x' A y < y'). 

Based on the analysis of computational geometry algorithms, our 2-dimensional 



O The tree with circle nodes is a BBST on ^-coordinates. 

□ The arrays with ^-coordinates play the role of the associated BBST. 


Fig. 1 . A 2-dimensional range tree. 


range tree for a set S of n points requires 0(n log n) storage and can be constructed 
in 0(n log n) time. 

The range search problem consists to find all the points of S that satisfy a range 
query. A useful idea, in terms of efficiency, consists on treating a rectangular range 
query as a two nested 1-dimensional queries. In other words, let [x\ : X 2 ] x [yi : 7 / 2 ] 
be a 2-dimensional range query, we first ask for the points with ^-coordinates 
in the given 1-dimensional range query [x\ : £ 2 ]. Consequently, we select a 
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collection of O(logn) subtrees. We consider only the canonical subset of the 
resulted subtrees, which contains, exactly, the points that lies in the x-range 
[x\ : X 2 ]. At the next step, we will only consider the points that fall in the y -range 
[2/1 : 2/2] • 

The total task of a range query can be performed in 0( log 2 n + k) time, where k 
is the number of points that are in the range. We can improve it by enhancing 
the 2-dimensional range tree with the fractional cascading technique which is 
described in the following paragraph. 

The key observation made during the invocation of a rectangular range 
query is that we have to search the same range [ y\ : 1 / 2 ] in the associated 
structures of O(logn) nodes found while querying the main BBST by the range 
query [xi : X 2 ]. Moreover, there exists an inclusion relationship between these 
associated structures. The goal of the fractional cascading consists on executing 
the binary search only once and use the result to speed up other searches without 
expanding the storage by more than a constant factor. 

The application of the fractional cascading technique introduced by [3] on a 
range tree creates a new data structure so called layered range tree. The technique 
consists to add pointers from the entries of an associated data structure T asS oc of 
some level to the entries of an associated data structure below, say T'assoc as 
follows: If T aS soc[i\ stores a value with the key yi, then we store a pointer to the 
entry in T’assoc with the smallest key larger than or equal yi- We illustrate such 
technique in Fig. [ 2 ] for the same set represented in Fig. [T] Using this technique, 
the rectangular search query time becomes 0(logn + k ), O(logn) for the first 
binary search and 0(k) for browsing the k reported points. 



O The tree with circle nodes is a BBST on rr-coordinates. 

I~1 The arrays with ^/-coordinates play the role of the associated BBST. 
—» The pointers of the fractional cascading (null pointers are omited). 


Fig. 2. A layered range tree, an illustration of the fractional cascading (only between 
two levels). 
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3 String Subsequence Kernels 

The philosophy of all string kernel approaches can be reduced to different ways 
to count common substrings or subsequences that occur in the two strings to 
compare. This philosophy is manifested in two steps: 

— Project the strings over an alphabet E to a high dimension vector space F, 
where the coordinates are indexed by a subset of the input space. 

— Compute the distance (inner product) between strings in F. Such distance 
reflects their similarity. 

For the String Subsequence Kernel (SSK) [2], the main idea is to compare strings 
depending on common subsequences they contain. Hence, the more similar strings 
are ones that have the more common subsequences. However, a new weighting 
method is adopted. It reflects the degree of contiguity of the subsequence in 
the string. In order to measure the distance of non contiguous elements of the 
subsequence, a gap penalty A s]0,1] is introduced. Formally, the mapping function 
</> p (s) in the feature space F can be defined as follows: 

r u (s)= E 

I:u=s(I ) 


The associated kernel can be written as: 

K p ( S ,t) = 

= E #(«)■#(*) 

_ ^ ^ ^ ) ^Z(J)+Z(J) 

u£Up I:u—s{I ) J:u=t(J) 


In order to clarify the idea of the SSK, we present a widespread example in the 
literature. Consider the strings bar , bat , car and cat , for a subsequence length 
p = 2, the mapping to the feature space is as follows: 


<t>i 

ar 

at 

ba 

br 

bt 

ca 

cr 

ct 

bar 

A“ 

0 

A 2 

A d 

0 

0 

0 

0 

bat 

0 

A 2 

A 2 

0 

A 3 

0 

0 

0 

car 

A 2 

0 

0 

0 

0 

A 2 

A 3 

0 

cat 

0 

A 2 

0 

0 

0 

A 2 

0 

A 3 


The unnormalized kernel between bar and bat is A' 2 (bar, bat) = A 4 , while the 
normalized version is obtained by : 


K 2 (bar,bat) = K 2 (bar, bat) / \JK 2 (bar, bar).K 2 (bat, bat) = A 4 /(2A 4 +A 6 ) = l/(2+A 2 ). 
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A direct implementation of this kernel leads to 0(\S P \) time and space complexity. 
Since this is the dimension of the feature space. To assist the computation of the 
SSK a Suffix Kernel is defined through the embedding given by: 

W s (s)= * KI) ,ueZ p , 

/e4 s| :w=s(J) 

where I p denotes the set of p-tuples of indices I with i p = k. In other words, we 
consider only the subsequences of length p that the last symbol is identical to 
the last one of the string s. The associated kernel can be defined as follows: 


KS( S ,t) = (r’ s (s),r’ s (t)) 

= e r u ’ s (s)-r u ’ s (t). 

To illustrate this kernel counting trick, we take back the precedent example where 
the mapping is as follows: 


TT,15 

ar 

at 

ba 

br 

bt 

ca 

cr 

ct 

bar 

A 2 

0 

0 

A J 

0 

0 

0 

0 

bat 

0 

A 2 

0 

0 

A 3 

0 

0 

0 

car 

A 2 

0 

0 

0 

0 

0 

A 3 

0 

cat 

0 

A 2 

0 

0 

0 

0 

0 

A 3 


for example K%(bar, bat ) = 0 and K.f (bat, cat) = A 4 . 

The SSK can be expressed in terms of its suffix version as: 

Id 1*1 

K p (s,t) = (*( 1 : O.^ 1 : J'))- (!) 

i=1 i=i 

with K?(s,t) = [s H = t| t |] A 2 . 

3.1 Naive Implementation 

The computation of the similarity of two strings (sa and tb) is conditioned by 
their final symbols. In the case where a = b, we have to sum kernels of all prefixes 
of s and t. Hence, a recursion has to be devised: 

Id Id 

K*(sa,tb) = (a = b]Y / Y, x2+lsl ~ i+ltl ~ jK P-M 1 ■ 0 ,*(! ■ J))- ( 2 ) 

*=1 3 =1 

This computation leads to a complexity of 0(p(|s| 2 |f | 2 )). 
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3.2 Efficient Implementations 

We present three methods that compute the SSK efficiently, namely the dynamic 
programming [Bj, the trie-based E0B3 and the sparse dynamic programming 
approaches . 

To describe such approaches, we use two strings s = gatta and t = cata as a 
running example. 


Dynamic Programming Approach. The starting point of the dynamic pro¬ 
gramming approach is the suffix recursion given by equation ([2]). From this 
equation, we can consider a separate dynamic programming table DP p for storing 
the double sum: 

k i 

DPp(k,l ) = ^^^'^(5(1 : i),t(l:j)). (3) 

i=l j =1 

It is easy to see that: K p (sa,tb) = [a = b] X 2 DP p (\s\, |f|)). 

Computing ordinary DP p for each (fc, l) would be inefficient. So we can devise a 
recursive version of equation ([3]) with a simple counting device: 

DP p (k,l ) = i\p_ 1 (s(l : k),t( 1 : l)) + XDP p (k-l,l) + 

XDP p {k , l- 1) - A 2 DP p {k -1,1-1). 

Consequently, using the dynamic programming approach (Algorithm [lj , the 
complexity of the SSK becomes 0(p|s||t|). 

Table [T] illustrates the computation of the dynamic programming tables for 
the running example for p = 1,2. The evaluation of the kernel is given by the 
sum of entries of the suffix table KPS: 

Ki(s, t) = 6A 2 . 

K 2 {s,t) = 2A 4 + 2A 5 +A 7 . 


Trie-based Approach. This approach is based on search trees known as tries, 
introduced by E. Fredkin in 1960. The key idea of the trie-based approach is that 
leaves play the role of the feature space indexed by the set 2J P of strings of length 
p. In the literature, there are variants of trie-based string subsequence kernels. 
For instance the ( p, m)-mismatch string kernel [5\ and restricted SSK [8]. In the 
present section, we try to describe a trie-based SSK presented in 7] that slightly 
differ from those cited above |5l8j . Figure [3] illustrates the trie data structure 
for the running example. Each node in the trie corresponds to a co-occurrence 
between strings. The algorithm maintains for all matches u = s(I) = U\ ■ ■ ■ u q , 
I = i 1 ■ ■ • i q a list of alive matches A s ( u , g) as presented in Table [ 2 ] that records 
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Algorithm 1: Dynamic SSK computation 
Input: Strings s and t, the length of the subsequence p, and the decay penalty A 
Output: Kernel values K q (s, t) = K(q) : q = 1,... ,p 

1 m <— length(s) 

2 n «— lengt.h(t) 

3 A'(l : p) <— 0 

/* Computation of K\(s,t) */ 

4 for i = l:m do 


5 

for j 

= l:n do 

6 


if s[i] = tfjj then 

7 



KPS[i,j] A 2 

8 



K[l]<-K[l] + KPS[i,j] 


9 

10 

11 

12 

13 

14 

15 


/* Computation of K q (s, t) : q = 2,... ,p 
for q = 2:p do 
for i = l:m do 
for j = l:n do 

DP[i,j] «- KPS[i,j] + \DP[i-l,j] + \DP[i,j-l] 
if s[i] = t[j] then 

KPS[i,j] <- A 2 DP[i - 1 ,j- 1] 

L K[q]<-K[q]+KPS[i,j\ 


*/ 


X 2 DP[i-l,j-l] 


the last index i q where g = 1(1) — |/| is the number of gaps in the match. Notice 
that in the same list we are able to record many occurrences with different gaps. 
Alive lists for longer matches uc, c £ £, can be constructed incrementally by 
extending the alive list corresponding to u. Similarly, the algorithm is applied to 
the string t. The process will continue until achieving the depth p. For efficiency 
reasons, we need to restrict the number of gaps to a given integer g max , so the 
computation is approximate. The kernel is evaluated as follows: 


K p (s,t) 


E mm = E ^ +p \m9s)\-x 9t+p \Lt(u, 9t )\. 

uGSP g s ,9t 


Given that, there are ( p + 9ma possible combinations to assign p letters and 
gmax gaps in a window of length p + g ma x, the worst-case time complexity of the 
algorithm is 0(( p + 9maa: ) (|s| + |f|)). 

The string subsequence kernel for the running example for p = 1 is: 


(s,i) = A 0+1 |A s (’a’,0)| • A 0+1 |A t (’a’,0)| + A 0+1 |A s (’t’,0)| • A 0+1 |A t (V,0)| 
= 4 • A 2 + 2 • A 2 = 6 • A 2 . 
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Table 1. Suffix tables and dynamic programing tables to compute the SSK for p = 1,2. 


KPS i 

g 

a 

t 

t 

a 

c 






a 


A 2 



A 2 

t 



A 2 

A 2 


a 


A 2 



A 2 

dp 2 

g 

a 

t 

t 

a 

C 

0 

0 

0 

0 


a 

0 

A 2 

A 3 

A 4 


t 

0 

A 3 

a 2 + a 4 

A 2 + A 3 + A 5 


a 






kps 2 

g 

a 

t 

t 

a 

C 






a 






t 


A 4 


A 5 


a 





A 4 + A 5 + A 7 



Fig. 3. The trie data structure for the running example s = gatta, t = cata 


Similar computation is performed for I\ 2 and K 3 : 

K 2 (s,t) = (1 ■ A 2+2 ) • (1 • A 1+2 ) + (1 ■ A 0+2 + 1 • A 1+2 ) • (1 ■ A 0+2 ) + (1 • A 0+2 + 1 • A 1+2 ) • (1 • A 0+2 ) 
= A 7 + 2-A 5 + 2-A 4 . 


and 


K 3 {s,t) = (2 • A 1+3 ) • (1 • A 0+3 ) = 2 • A 7 . 
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Table 2. The alive indices for all subsequences of the running example for p = 1,2,3 
with the number of gaps from 0 to 3. 


g 

A s (’a \g) 

At(’a’,g) 

-AsOt’, g) 

At(’t’, g) 

A s (’aa’,fl) 

At(’aa’,g) 

0 

{2,5} 

{2,4} 

{3,4} 

{3} 



1 






{4} 

2 





{5} 


3 







g 

A s (’at’,g) 

At(’at’, g) 

A s (’ta’,c/) 

At(’ta’,g) 

A 3 (’ata’, g) 

A t (’ata’, g) 

0 

{3} 

{3} 

{5} 

{4} 


{4} 

1 

2 

{4} 


{5} 


{5,5} 


3 








Sparse Dynamic Programming Approach. It is built on the fact that in 
many cases, most of the entries of the DP matrix are zero and do not contribute 
to the result. Rousu and Shawe-Taylor [7] have proposed a solution using the 
sparse dynamic programming technique to avoid unnecessary computations. To 
do so, two data structures were proposed: the first one is a range sum tree, which 
is a B-tree, that replaces the DP p matrix. It is used to return the sum of n values 
within an interval in O(logn) time. The second one is a set of match lists instead 
of K p matrix. L q (i) = {(j 1 ,K^(s(l : i),t( 1 : j 1 )),(j 2 ,K^(s(l : i),t( 1 : j 2 )),---} 
where K^(s(l : i),t( 1 : j)) = X m ~' l+n ~^ (s( 1 : t),f( 1 : j)). This dummy gap 
weight \ m - z + n -j allows to address the problem of scaling the kernel values as 
the computation progress. Consequently the recursion ©> becomes: 

Kp {sa, tb) = [a = b\\ 2 ^ ^ K%_ i(s(l : i),t( 1 : j)). (4) 

<<M 1*1 


and the separate dynamic programming table ([3]) can be expressed as follows: 

= (s) 

i<k j<i 

Thereafter, the authors devise a recursive version of 

DP P {k , l) = DPp(k - 1 , l) + : *), t(l : j)). (6) 

j<l 


This can be interpreted as reducing the evaluation of an orthogonal range query 
([5]) to an evaluation of a simple range query multiple times as much as the number 
of lines of the K p matrix. 

To evaluate efficiently a range query, the authors use a range-sum tree to 
store a set S = {(j, Vj)} C {1,... n} x R of key-value pairs. A range-sum tree is 
a binary tree of height h = [log n] where each node in depth d = 0,1,... ,h — 1 
contains a key j with a sum of values in a sub range [j — 2 h ~ d + 1 ,j}. The root 
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is labeled with 2 h , the left child of a node j is j — j /2 and the right child if it 
exists is j + j/2. Odd keys label the leaves of the tree. 

To compute the range sum of values within an interval [1, j] it suffices to 
browse the path from the node j to the root and sum over the left subtrees as 
follows: 


Rangesum([l, j]) — Vj + E Vh■ 

h£ Ancestor s(j) / h<j 


Moreover to update the value of a node j, we need to update all the values of 
parents that contain j in their subtree (h £ Ancestors(j)/h > j). These two 
operations are performed in 0(log n)time because we traverse in the worst case 
the height of the tree. 

For the sparse dynamic programming algorithm (Algorithm [2]) the range-sum 
tree is used incrementally when computing ©>■ So that when processing the 
match list L p (k) the tree will contain the values Vj that satisfy K p _ 1 (s( 1 : 

i),t( 1 : j))i 1 < j < l- Hence the evaluation of ([6J) is performed by involving a 
one dimensional range query: 


i 

Rangesum([l, j]) = Vj 
j= i 

k i 

= EE A t-i( s ( 1: : -?)) 

*=i j = i 
= DPp{k,l). 


Concerning the cost of computation of this approach, the set of match lists is 
created on 0{m + n + |A| + |Ti|) time and space, while the kernel computation 
time is 0{p\Li\ logn), knowing that |Ti| > \L 2 \ > ... > \L p \. 

To illustrate the mechanism of the sparse dynamic programming algorithm, 
Figure [Z] depicts the state of the range-sum tree when computing ATf(s,t). 
Initially the set of match lists is created as follows: 


£1(1) = 0 

^(2) = ((2, A 7 ), (4, A 5 )) 
L 1 (3) = ((3,A 5 )) 

L 1 (4) = ((3,A 4 )) 

^(5) = ((2, A 4 ), (4, A 7 )). 
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A 7 + 2A 5 + 2A 4 + A 2 



Fig. 4. The state of the range-sum tree when computing K 2 (s, t) 


Meanwhile maintaining the range-sum tree, the algorithm update the set of 
match lists as presented below: 

L*( 1) = () 

L 2 ( 2 ) = () 

L 2 ( 3) = ((3,A 7 )) 

L 2 (4) = ((3,A 7 )) 

L 2 (5) = ((4,A 7 + A s + A 4 )). 


Finally, summing the values of the updated match list after discarding the dummy 
weight gives the kernel value K 2 (s,t ): 

K 2 (s,t ) = A 7 • A" 9+3+3 + A 7 • a _9+4+3 + (A 7 + A 5 + A 4 ) • A" 9+5+4 
= A 7 + 2A 5 + 2A 4 . 


4 Geometric based Approach 

Looking forward to improving the complexity of SSK, our approach is based 
on two observations. The first one concerns the computation of K^{s,t) that is 
required only when S| s | = t\ t \. Hence, we have kept only a list of index pairs of 
these entries rather than the entire suffix table, L(s,t) = {(i,j) '■ Si =tj}. 

In the rest of the paper, while measuring the complexity of different compu¬ 
tations, we will consider, |L|, the size of the match list L(s,t) as the parameter 
indicating the size of the input data. 

The complexity of the naive implementation of the list version is 0(p\L\ 2 ), 
and it seems not obvious to compute Kp(s,t ) efficiently on a list data structure. 
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Algorithm 2: Sparse Dynamic SSK computation 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 


Input: Strings s and t,the length of the subsequence p , and the decay penalty A 
Output: Kernel value K p (s,t) = K 
m 4— length(s) 
n 4— length(t) 

Creation of the set of match lists L\ 

for q = 2:p do 

Rangesum(l : n) 4— 0 (Initialization of the range-sum tree) 

for i = l:m do 

foreach (j h ,v h ) G L,-i(i) do 
S 4— Rangesum[l,jh — 1] 
if S > 0 then 
|_ appendlist(Lq(i), (jh, S)) 


11 

12 


/* Update of the range-sum tree 
foreach (jh,v h ) G Z/ 9 _i(i) do 
|_ update(Rangesum, ( jh,Vh )) 


*/ 


13 

14 

15 

16 


/* Computation of the kernel value for the final level 

K 4- 0 

for i = l:m do 

foreach ( jh,Vh ) G L p (i) do 
|_ K 4— K + v h \~ m ~ n+i+ih 


*/ 


In order to address this problem, we have made a second observation that the 
suffix table can be represented as a 2-dimensional space (plane) and the entries 
where sp| = ty\ as points in this plane as depicted in Fig. [5] In this case, the 
match list generated is 

L(s, t) = {A, B, C, D, E, F} = {(2,2), (2,4), (3,3), (4,3), (5,2), (5,4)}. 

With a view to improving the computation of the SSK, it is easy to perceive 
from Fig. [5] that the computation of © can be interpreted as orthogonal range 
queries. In this respect, we have used a layered range tree (LRT) in T|. But the 
LRT data structure reports all points that lie in a specific range query. However, 
for the SSK computation we require only the sum of values of the reported points. 

To achieve this goal, we extend the LRT with the aggregate operations, 
in particular the summation one. Hence, a novel data structure was created, 
for instance a Layered Range Sum Tree (LRST). A LRST is a LRT with two 
substantial extensions to reduce the range sum query time from 0(log \L\ + k) 
to 0(log \L\) where k is the number of reported points in the range. 

The first extension consists to substitute the associated data structures T aS soc 
in the LRT with new associated data structures T'assoc where each entry i 
contains a key-value pair ( j,psj ), psj = 1 v k is the partial sum of T aS soc in 

the position i. This extension is made to compute the range sum of Tassoc within 
[i,j] in 0(1) time as follows : Rangesum[i, j] = T' a ssoc[j] - T' assoS - !]• 
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Fig. 5. Representation of the suffix table as a 2-dimensional space 


The second extension involves the fractional cascading technique. Let T' assoc 1 
and T’associ be two sorted arrays that store partial sums of Tassoci and T aS soc2 
respectively. Suppose that we want to compute the range sum within a query 
q = [2/1 3 3 / 2 ] in Tassoci and T asS oc 2 - We start with a binary search with 2/1 in 
T'associ to find the smallest key larger than or equal 2/1- We make also an other 
binary search with y 2 in T'associ to find the largest key smaller than or equal 
2 / 2 - If an entry T’associ [*] stores a key y,; then we store a pointer to the entry in 
T'assoc2 with the smallest key larger than or equal to yt, say small pointer, and 
a second pointer to the entry in T' a ssoc2 with the largest key smaller than or 
equal to y*, say large pointer. If there is no such key(s) then the pointer(s) is 
(are) set to nil. 

It is easy to see that our extensions does not affect neither the space nor the 
time complexities of the LRT construction. So according to the analysis of of 
computational geometry algorithms, our LRST requires 0(\L\ log \L\) storage 
and can be constructed in 0(\L\ log |L|) time. This leads to the following lemma. 

Lemma 1. Let s and t be two strings and L(s,t) = {( i,j ) : s, = tj} the match 
list associated to the suffix version of the SSK. A Layered range sum tree (LRST) 
for L(s,t) requires 0(|L| log |L|) storage and takes 0(|L| log |L|) construction 
time. 

We can now exploit these extensions to compute efficiently the range sum inherent 
to q — [ 2 / 1 , 2 / 2 ] ffi Tassoci and Tassoc 2 ■ Let T associ[^i] and T associ^d] be the 
results of the binary search with yi and 2/2 respectively in T'associ■ So the 
Rangesum(yi,y 2 ) = T' a ssoci[h] - T' a ssoci[ii - 1] in Tassoci takes 0(log|L|) 
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time. To compute the range sum in Tassoci we avoid the binary searches. We 
consider first the entry T' asS oc2 [j\ ] pointed by the small pointer of T'assod [*i], 
it contains the smallest key from T' aS soc2 larger than or equal to y \. The second 
entry is T' aS soc2 [j' 2 ] pointed by the large pointer of T' assoc 1 ^ 2 ], it contains the 
largest key from T'assoc 2 smaller than or equal to y 2 . Finally the range sum within 
[yi, 2 / 2 ] in Tassoc 2 is given by Rangesum(yi,y 2 ) = V assoc 2 [. 72 ] T assoc 2 \.'j [ 1] 
and it takes 0 ( 1 ) time. 


Algorithm 3: Geometric SSK computation 


Input: Strings s and f,the length of the subsequence p, and the decay penalty A 
Output: Kernel values K q (s, t) = K(q) : q = 1,... ,p 

1 m 4— length(s) 

2 n 4— length(t) 

3 Creation of the initial match list L 

/* Computation of K\(s,t) */ 

4 foreach ((i,j),v) £ L do 

5 L K W K[ 1] + v ■ X i+j 


/* Computation of K q (s, t) : q m 2,... ,p 

6 for q = 2:p do 

7 Building of the LRST corresponding to the match list L 

8 foreach (( i,j),v ) £ L do 

/* Preparing the range query for the entry (i,j) 

9 rq •<— [(0| — oo) : (* — 1| + oo)] x [(0| — oo) : (j — 1| + oo) 

10 result 4— Rangsum[rq\ 

11 if result > 0 then 

12 K[q] = K[q\ + result ■ A l+J 

13 appendlist(neu)L, ((i, j), result)) 

14 L 4— newL 


*/ 


*/ 


For our geometric approach (Algorithm [3]) we will use the LRST to evaluate 
the SSK. We start by the creation of the match list L(s,t) = {((i,j),K^(s( 1 : 

: j ))) : Si = tj} where K^(s( 1 : i),t(l : j)) = K^(s{\ : *),t(l : j)). 

This trick is inspired from [7] to make the range sum results correct. Thus the 
recursion © becomes as follows: 

Kp (sa, tb) = [a=b\J2 x ' +3K $- i( s ( 1 : l )^ 1 : i))- ( 7 ) 

*<|s| j<\t\ 

In order to construct efficiently the match list we have to create for each charac¬ 
ter c £ £ a list 1(c) of occurrence positions (c = Si) in the string s. Thereafter, for 
each character tj £ t we insert key-value pairs ((i,j),K^(s(l : i),t( 1 : j))) in the 
match list L(s, t) corresponding to I(tj). This process takes 0(|s| + |t| +1 S\ + \L\) 
space and 0(|s| + |A|-|-|L|) time. For example, the match list for our running exam¬ 
ple is L(s, t) = {((2,2), A 7 ), ((2,4), A 5 ), ((3,3), A 5 ), ((4,3), A 4 ), ((5,2), A 4 ), ((5,4), A 2 ) 
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Once the initial match list created, we start computing the SSK for the 
subsequence length p = 1. This computation doesn’t require the LRST ; it 
suffices to traverse the match list and sum over its values. For length subsequence 
q > 1 we will first create the LRST corresponding to the match list, afterward for 
each item ((fc, 2), A'^(s(l : k),t( 1 : l ))) we invoke the LRST with the query rq = 
[0, k — 1] x [0, Z — 1]. This latter return the range sum within rq: Rangesum(rq) = 
Y,j<i (K^(s(l : i),t( 1 : j))). If Rangesumfrq) is positive then insert the 
key-value in a new match list for the level q+ 1 and summing the Rangesum(rq) 
to compute the SSK at the level q. At each iteration, we have to create a new 
LRST corresponding to the new match list until achieving the request subsequence 
length p. 

We recall that in our case, we use composite numbers instead of real numbers, 
see section ( | 2 . 5 | ). In such situation, we have to transform the range query [ x\ : 
£2] x [2/1 : 2/2] related to a set of points in the plane to the range query [(aq| — 00) : 
(X2I + 00)] x [(2/1| — 00) : (2/2 1 + 00)] related to the composite space. 

Using our geometric approach, the range sum query time becomes 0 (log |L|). 
For the computation of Kp(s,t) we have to consider \L\ entries of the match list. 
The process iterates p times, therefore, we get a time complexity of 0(p\L\ log |L|) 
for evaluating the SSK. This result combined to that of Lemma. [T| lead to 
the following theorem that summarizes the result of our proposed approach to 
compute SSK. 

Theorem 2. Let s and t be two strings and L(s,t) = {( i,j ) : Si = tj} the match 
list associated to the suffix version of the SSK. A layered range sum tree requires 
0(\L\ log |L|) storage and it can be constructed in 0(\L\ log |L|) time. With these 
data structures, the SSK of length p can be computed in 0(p\L\ log \L\)). 

To compute /V 2 (s,t), for our running example, we have to invoke the range sum 
on the LRST at the step p= 2 represented by Fig j6] The SSK computation is 
performed by summing over all the range sums correponding th the entries of 
the match list as follows: I\ 2 {s,t) = Rangesum[{ 0| — 00) : (1| + 00)] x [(0| — 
00) : (1| + 00)] + Rangesum[( 0| — 00) : (1| + 00)] x [(0| — 00) : (3| + 00)] + 
Rangesum{(0\ — 00) : (2| + 00)] x [(0| — 00) : (2| + 00)] + Rangesum[( 0| — 00) : 
(3| + 00)] x [(0| — 00) : (2| + 00)] + Rangesum[(0\ — 00) : (4| + 00)] x [(0| — 00) : 
(1| + 00)] + Rangesum[( 0| — 00) : (4| + 00)] x [(01 — 00) : (3| + 00)]. 

To describe how this can be processed, we deal by the range sum of the query 
[(0| — 00) : (4 1 + 00)] x [(0| — 00) : (3| + 00)]. At the associate data structure 
corresponding to the split node (3|3) of Figjbjwe find the entries (2|2) and (3|4) 
whose y — coordinates are the smallest one larger than or equal to ( 0 | — 00) and 
the largest one smaller or equal to (3| + 00) respectively. This can be done by 
binary search. Next, we look for the nodes that are below the split node (3|3) and 
that are the right child of a node on the search path to ( 0 | — 00) where the path 
go left, or the left child of a node on the search path to (4| + 00) where the path 
go right. The collected nodes are (3|3), (2|2), (4|3) and the result returned form 
the associated data structures is A -5 + A -4 + A -2 . This is done on a constant 
time by following the small and large pointers form the associated data structure 
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O The tree with circle nodes is a BBST on ^-coordinates. 

□ The arrays with y-coordinates play the role of the associated BBST. 

—» Small pointers of the extended fractional cascading (null pointers are omited). 
—> Large pointers of the extended fractional cascading (null pointers are omited). 


Fig. 6. The state of the layered range sum tree for the running example at the step 
p = 2 with an illustration of the extended fractional cascading (only between two levels). 


of the split node. By the same process we obtain the following results of the 
invoked range sums: 

Rangesum[{ 0| — oo) : (1| + oo)] x [(0| — oo) : (1| + oo)] = 0 

Rangesum[( 0| — oo) : (1| + oo)] x [(0| — oo) : (3| + oo)] = 0 

Rangesum[( 0| — oo) : (2| + oo)] x [(0| — oo) : (2| + oo)] = A~ 2 

Rangesum[( oj — oo) : (3| + oo)] x [(0| — oo) : (2| + oo)] = A~ 2 

Rangesum[(0\ — oo) : (4| + oo)] x |(0| — oo) : (lj + oo)] = 0 

After rescaling the returned values by the factor X z+: > we obtain the value of 

K 2 (s,t) = A" 2 • A 3+3 + A" 2 • A 4+3 + (A" 5 + A" 4 + A" 2 ) • A 5+4 = 2A 4 + 2A 5 + 

A 7 . While invoking the range sums we will prepare the new match list for 

the next step. In our case the new match list contains the following matchs : 

{((3,3), A" 2 ), ((4, 3), A" 2 ), ((5, 2), A -5 + A" 4 + A" 2 )}. 

5 Experimentation 

In this section we describe the experiments that focus on the evaluation of our 
geometric algorithm against the dynamic and the sparse dynamic ones. Thereafter, 
these algorithms are referenced as Geometric, Dynamic and Sparse respectively. 
We have discarded the trie-based algorithm from this comparison because it is 
an approximate algorithm on the one hand, on the other hand in the preliminary 
experiments conducted in |7] it was significantly slower than Dynamic and Sparse. 

To benefit from the empiric evaluation conducted in [7], we tried to keep the 
same conditions of their experiments. For this reason, we have conducted a series 
of experiments on both synthetically generated and on newswire article data on 
Reuter’s news articles. 
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We ran the tests on Intel Core i7 at 2.40 GHZ processor with 16 GB RAM 
under Windows 8.1 64 bit. We implemented all the tested algorithms in Java. For 
the LRST implementation, we have extended the LRT implementation available 
on the page https://github.com/epsilony/ 

5.1 Experiments with synthetic data 

These experiments concern the effects of the string length and the alphabet 
size on the efficiency of the different approaches and to determine under which 
conditions our approach outperforms. 

We randomly generated string pairs with different lengths (2,4,... 8192) over 
alphabets of different sizes (2,4,... 8192). To simplify the string generation, we 
considered string symbols as integer in [1, alphabet size]. For convenience of data 
visualization, we have used the logarithmic scale on all axes. To perform accurate 
experiments, we have generated multiple pairs for the same string length and 
alphabet size and for each pair we have took multiple measures of the running 
time with a subsequence length p = 10 and a decay parameter A = 0.5. This being 



Fig. 7. Running Time of the Geometric algorithm on synthetic data. 


said, Fig. [7] reveals, for our geometric approach, an inverse dependency of the 
running time with the alphabet size. However, for an alphabet size the running 
time is proportional to the string length. Figure [8] shows experimental comparison 
of the performance of the proposed approach against Dynamic. Note that the rate 
of 100% indicates that the two algorithms deliver the same performances. For the 
rates less than 100% our approach outperforms, it is the case for strings based on 
medium and large alphabets excepting those having short length (say alphabet 
size great than or equal 256, where the string length exceeds 128 characters). For 
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Fig. 8 . Relative running Time on synthetic data: Geometric/Dynamic. 


short strings and also for long strings based on small alphabets, Dynamic excels. 
It remains to present results of the comparison experiment with Sparse which 
share the same motivations with our approach. Rousu and Shawe-Taylor 7] state 
that with long strings based on large alphabets their approach is faster. Figure [9] 
shows that in these conditions our approach dominates. Moreover, our approach 
is faster than the Sparse one for long strings and for large alphabets absolutely, 
but gets slower than Sparse for short strings based on small alphabets. 


5.2 Experiments with newswire article data 

Our second experiments use the Reuters-21578 collection to evaluate the speed of 
Geometric against Dynamic and Sparse on English articles. We created a dataset 
represented as sequences of syllables by transferring all the XML articles on to 
text documents. Thereafter, the text documents are preprocessed by removing 
stop words, punctuation marks, special symbols and finally word syllabifying. 
We have generated 22260 distinct syllables. As in the first experiment, each 
syllable alphabet is assigned an integer. To treat the documents randomly, we 
have shuffled this preliminary dataset. 

For visualization convenience, while creating document pairs, we have ensured 
that their lengths are close. Under this condition, we have collected 916 pair 
documents as final dataset. 

To compare the candidate algorithms, we have computed the SSK for each 
document pair of the data set by varing the subsequence length form 2 to 20. 
Figure [lO] and Figure [Tl] depict the clusters of documents where Geometric is 
faster than Dynamic and Sparse respectively. A document pair (s, t) is plotted 
according to the inverse match frequency (X-axis) and the document size (Y-axis). 
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Fig. 9. Relative running Time on synthetic data: Geometric/Sparse. 
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The inverse match frequency is given by: |s||t|/|L|, it plays the role of the alphabet 
size 1171 inherent to the documents s and t. The document size is calculated as 
the arithmetic mean of the document pair sizes, it plays the role of the string 
length. Each cluster is distinguished by a special marker that corresponds to the 
necessary minimum subsequence length to make Geometric faster than Dynamic 
or Sparse. For the cluster marked by black diamonds, p < 5 is sufficient. The 
length 5 < p < 10 is required for the cluster marked by blue filled squares. For 
the cluster marked by green circles 10 < p < 20 is required and the last cluster 
marked by plus signs p > 20 is needed. We can distinguish three cases in Fig. [lOj 
The first one arises when the inverse match frequency is weak (small alphabet 
size), that is to say for dense matrix. In this case, we require important values 
of the subsequence length ( p > 10 for small documents and p > 20 for larger 
ones) to make Geometric faster than Dynamic. The second case concerns good 
inverse match frequencies (large alphabet size) corresponding to sparse matrix. 
In this case, small values of the subsequence length (p < 5 ) suffice to make 
Geometric faster than Dynamic. The third case appear for moderate inverse 
match frequency (medium alphabet size), the values of p that makes Geometric 
faster than Dynamic depend on the document size. The large document size 
the large p is required. The results of the comparison between Geometric and 
Sparse on newswire article data are depicted in Fig. El We can discuss 3 cases: 
The first case emerge when the document size becomes large and also for good 
inverse match frequency. In this case small values of the subsequence length 
(p < 5 ) suffice to make Geometric faster than Sparse. The second case appear for 
small documents and bad inverse match frequencies. The necessary subsequence 
length must be important ( p > 10 for very small documents and p > 20 for the 
small ones). The third case concerns modurate inverse frequencies. In this case 
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Fig. 10. Clusters of document pairs where Geometric is faster than Dynamic according 
to the subsequence length p. 


the value of the subsequence length that makes Geometric faster than Sparse 
depends on the document size except large sizes which fall in the first case. 


5.3 Discussion of the experiment results 

In step with the results of the two experiments, it is easy to see that the algorithms 
behave essentially in the same way both on synthetically generated data and 
newswire article data. These results reveal that our approach outperforms for 
large alphabet size except for very small strings. Moreover, regarding to the 
Sparse, Geometric is competitive for long strings. 

We can argue this as follows: first, the alphabet size and the string length 
affect substantially the kernel matrix form. Large alphabets can reduce potentially 
the partially matching subsequences especially on long strings, giving rise to 
sparse matrix form. Consequently, great number of data stored in the kernel 
matrix do not contribute to the result. In the other cases, for dense matrix, our 
approach can be worse than Dynamic by at most Log\L\ factor. 

On the other hand, The complexities of Geometric and Sparse differ only by 
the factors Log\L\ and Log n. The inverse dependency of \L\ and X 1 goes in favor 
of our approach. Also, the comparisons conducted on our datasets give evidence 
that for long strings \L\ « n, remembering that the size of the match list 
decrease while the SSK execution progress. Moreover, to answer orthogonal range 
queries, Sparse invoke one dimensional range query multiple times. Whereas, 
Geometric mark good scores by using orthogonal range queries in conjunction 
with the fractional cascading and exploit our extension of the LRT data structure 
to get directly the sum within a range. 
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Fig. 11. Clusters of document pairs where Geometric is faster than Sparse according 
to the subsequence length p. 


6 Conclusions and further work 

We have presented a novel algorithm that efficiently computes the string subse¬ 
quence kernel (SSK). Our approach is refined over two phases. We started by 
the construction of a match list L(s,t ) that contains, only, the information that 
contributes in the result. Thereafter, in order to compute, efficiently, the sum 
within a range for each entry of the match list, we have extended a layered range 
tree to be a layered range sum tree. The Whole task takes 0{p\L\ log \L\) time 
and 0{\L \log |L|) space, where p is the length of the SSK and |L| is the initial 
size of the match list. 

The reached result gives evidence of an asymptotic complexity improvement 
compared to that of a naive implementation of the list version 0(p\L\ 2 ). The 
experiments conducted both on synthetic data and newswire article data attest 
that the dynamic programming approach is faster when the kernel matrix is 
dense. This case is achieved on long strings based on small alphabets and on 
short strings. Furthermore, recall that our approach and the sparse dynamic 
programming one are proposed in the context where the most of the entries of the 
kernel matrix are zero, i.e. for large-sized alphabets. In such case our approach 
outperforms. For long strings our approach behave better than the sparse one. 

This well scaling of the proposed approach with document size and alphabet 
size could be useful in very tasks of machine learning on long documents as 
full-length research articles. 

A noteworthy advantage is that our approach can be favorable if we assume 
that the problem is multi-dimensional. In terms of complexity, this can have 
influence the storage and the running time, only, by a logarithmic factor. Indeed, 
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the layered range sum tree needs 0{\L\ log d_1 |L|) storage and can compute the 
sum within a rectangular range in 0(log d_1 |L|), in a d-dimensional space. 

At the implementation level, great programming effort is supported by well- 
studied and ready to use computational geometry algorithms. Hence, the emphasis 
is shifted to a variant of string kernel computations that can be easily adapted. 

Finally, it would be very interesting if the LRST can be extended to be 
a dynamic data structure. This can relieve us to create a new LRST at each 
evolution of the subsequence length. An other interesting axis consists to combine 
the LRST with the dynamic programming paradigm. We believe that using 
rectangular intersection techniques seems to be a good track, though this seems 
to be a non trivial task. 
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